-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve custom transpilation for faster 1Q/2Q RB #922
Improve custom transpilation for faster 1Q/2Q RB #922
Conversation
|
||
# The following methods are used for RB with more than 2 qubits | ||
def _sample_circuits(self): | ||
self._cliff_utils = CliffordUtils( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be promoted to actual generator by including _sample_sequences
and _sequences_to_circuits
? I am implementing another RB variant, that partly uses Clifford sequence but final structure is totally different from standard RB (thus no inheritance). I don't want to implement boilerplate code in such a class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but let me go there (having RBCircuitGenerator class to delegate whole circuit generation) step by step. I'd like to focus on clean-up of Standard/InterleavedRB implementation in this PR, then clean-up of clifford_utils
in the next PR (I'll remove the creation of CliffordUtils objects there). I prefer further refactoring should come after them.
if self.num_qubits in [1, 2]: | ||
transpiled = self._layout_for_rb() | ||
has_custom_transpile_option = ( | ||
any( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized current implementation induces a serious issue when benchmarking the user-calibrated gate. Since conventional sequence implicitly triggers the pulse gate pass in the default pass managers that pulls gate calibrations from the backend instmap that doesn't explicitly show up in the transpile option. This is indeed the breaking API change, because custom calibration will be just ignored but user will never notice this. I think you need to explicitly manage inst map without using pass manager (for performance reason). Alternatively, we can assume V2 backend, where calibration data is always provided with gate entry. Note that this PR will be merged soon. This makes me think we need unittest for missing calibrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in cee4e78
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to have warning or completely remove transpile options, i.e overriding _set_transpile_options
method to raise warning/error. Usually users are not aware of what is happening inside .run()
and may set some random option from random example code. I think having inst_map
only makes sense in this experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. I like the idea restricting acceptable transpile options for each experiment. (too many possible transpile_options
was one of my headaches.) For RB experiments, I don't like to but I'll accept basis_gates
and optimization_level
for now as well as inst_map
. Because they are widely used in the tests (Fortunately, they seem not to be used in the tutorial). Forbidding them would require large refactoring of tests, so if we want that, I think it should be done in a separate PR.
7f2396f
to
0eac7f1
Compare
55604f7
to
d34bc12
Compare
0eac7f1
to
492fa1f
Compare
492fa1f
to
36b705f
Compare
This PR is now ready to be reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am submitting the comments I have up to here, so that you can have a look at them. I will continue to review, probably only on Thursday, because of our holiday.
qiskit_experiments/library/randomized_benchmarking/rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/clifford_utils.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @itoko san. Looks almost good to me. Mainly nitpicks.
qiskit_experiments/library/randomized_benchmarking/rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/rb_experiment.py
Outdated
Show resolved
Hide resolved
transpiled = super()._transpiled_circuits() | ||
if self.analysis.options.get("gate_error_ratio", None) is None: | ||
else: | ||
transpiled = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think worth having multi-threading here? Note that the standard Qiskit transpiler uses qiskit parallel and this is single thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's worth investigating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've examined parallel transpilation using qiskit.tools.parallel.parallel_map
and found that it's much slower than serial one. I guess that is due to the overhead of pickling circuits. Let me keep the current code as is.
Examined code for parallel transpilation:
circuits = self.circuits()
if (
len(circuits) > 1
and os.getenv("QISKIT_IN_PARALLEL", "FALSE") == "FALSE"
and parallel.PARALLEL_DEFAULT
):
# Custom transpilation in parallel
transpiled = parallel.parallel_map(
_transpile_clifford_circuit,
circuits,
task_kwargs={"physical_qubits": self.physical_qubits},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe because I'm using Mac (I somewhere heard parallel does multi-processing so slow in Mac)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On mac by default parallel_map
won't actually run in parallel on Python >= 3.8 (https://github.com/Qiskit/qiskit-terra/blob/main/qiskit/tools/parallel.py#L61-L83 defines the defaults) because in Python 3.8 they switched macOS multiprocessing
to use spawn
instead of fork
by default because of some platform bugs. Spawn has a lot of extra overhead and it's also unreliable for user code because you need a if __name__ == "__main__"
guard many times to distinguish a child process from the parent so we disable multiprocessing by default on macOS.
In general multiprocessing in python can be very tricky to work with and get a performance boost. Although for transpile()
with linux (so using fork()
) with involved compilation the performance is typically better even with the pickling and IPC overhead of separate processes (with some caveats Qiskit/qiskit#7741 )
qiskit_experiments/library/randomized_benchmarking/rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/interleaved_rb_experiment.py
Show resolved
Hide resolved
interleaved_circ.name = f"Clifford-{interleaved_circ.name}" | ||
# Transpile circuit with non-basis gates and remove idling qubits | ||
try: | ||
interleaved_circ = transpile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need full transpile here? Likely you just need to call basis translator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think it was for delays but it may be unnecessary now since I updated the handling of delays in 36193c9
Let me check if we can replace transpile with basis translation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it's possible but it would be difficult to explain the spec in doc (since we'd need to mention to a protected method _transform_clifford_circuit
). Now I think we should revisit here when we redesign the basis translation scheme (removing the use of transpile
for 1Q/2Q RB) in the future.
qiskit_experiments/library/randomized_benchmarking/clifford_utils.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completed the review. Overall, I think the code looks very nice. I also like the split you made in test_randomized_benchmarking
.
qiskit_experiments/library/randomized_benchmarking/clifford_utils.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/interleaved_rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/interleaved_rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/interleaved_rb_experiment.py
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/interleaved_rb_experiment.py
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/interleaved_rb_experiment.py
Outdated
Show resolved
Hide resolved
qiskit_experiments/library/randomized_benchmarking/interleaved_rb_experiment.py
Outdated
Show resolved
Hide resolved
Regarding the
I am not sure exactly where this comes from, but it seems like level 0 does not do a good enough job at transpiling the Cliffords. |
Benchmarking with https://gist.github.com/itoko/f078218ea6458f32d1c0f9be827f614f main: Main branch (0359f4c) method = time_transpiled_circuits for both settings Standard RB 1Q:
Standard RB 2Q:
Interleaved RB 1Q:
Interleaved RB 2Q:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with minor suggestions. They don't block merging.
qiskit_experiments/library/randomized_benchmarking/interleaved_rb_experiment.py
Outdated
Show resolved
Hide resolved
circ = QuantumCircuit(self.num_qubits) | ||
circ.barrier(qubits) | ||
circ.append(Barrier(self.num_qubits), circ.qubits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not addressed. Of course you can do this in the follow-up.
de2de95
to
9142df4
Compare
* Improve custom transpilation for faster 1Q/2Q RB with simplifying code structure * Avoid using general Gate objects * Fix handling of interleaved delays * Add custom calibrations support in transpiling circuits * Add more comments * Small cleanups * More precise retrieve of basis gates * Update qiskit_experiments/library/randomized_benchmarking/rb_experiment.py Co-authored-by: Naoki Kanazawa <[email protected]> * Renames and updates following review comments * Remove experiment_type from circuit metadata * Add backend V1 to V2 conversion * Fix docs following review comments * Separate a code block as a private method * Change to create BackendTiming object outside loop Co-authored-by: Naoki Kanazawa <[email protected]>
* Improve custom transpilation for faster 1Q/2Q RB with simplifying code structure * Avoid using general Gate objects * Fix handling of interleaved delays * Add custom calibrations support in transpiling circuits * Add more comments * Small cleanups * More precise retrieve of basis gates * Update qiskit_experiments/library/randomized_benchmarking/rb_experiment.py Co-authored-by: Naoki Kanazawa <[email protected]> * Renames and updates following review comments * Remove experiment_type from circuit metadata * Add backend V1 to V2 conversion * Fix docs following review comments * Separate a code block as a private method * Change to create BackendTiming object outside loop Co-authored-by: Naoki Kanazawa <[email protected]>
Improve the performance of transpiled circuit generation in 1Q/2Q StandardRB/InterleavedRB (about 10x speedup in 1Q SRB/IRB and 5x speedup in 2Q SRB/IRB in most cases). Including following feature pull-requests: * New algorithm for RB building Cliffords by layers (#892) * Improve custom transpilation for faster 1Q/2Q RB (#922) * Improve integer-based Clifford operations for 1Q/2Q RB (#940) * Readd support for backends with directed basis gates to RB experiments (#960) Features other than speedup: * Fix performance regression in 3 or more qubits RB (embedded at Refactor RB module for future extensions #898) * Improve validation of interleave_element in InterleavedRB * Restructure tests for RB module * Remove the barriers in front of the first Cliffords in RB circuits Co-authored-by: merav-aharoni <[email protected]> Co-authored-by: Naoki Kanazawa <[email protected]>
Summary
This PR is a follow-up of #892 for further performance improvement with several clean-ups suggested in the review of PR 892.
Details and comments
The drastic performance improvement in 1Q/2Q RB (PR 892) was mostly come from two ideas:
transpile
)PR 892 implements more than those two (including some not contributing to the performance improvement) and they make difficult to track what really improved the performance and seem overly reduce maintainability of the code.
This PR cleans up the implementation of Standard/InterleavedRB without performance regression and add further optimization in the implementation of (1) custom transpilation with following several suggestions given in PR 892. For insetances,
StandardRB._layout_for_rb
toclifford_utils._transpile_clifford_circuit
and add further optimization.(See also the original draft PR to roughly understand which code blocks are essentially changed for above updates)
Other features added in this PR includes:
This PR achieves further speed-up (1.2x for 1Q StandardRB, 1.1x for 2Q StandardRB, 3.4x for 1Q InterleavedRB, 1.6x for 2Q InterleavedRB) measured by this benchmark.
This PR will be followed by one more PR that cleans up
clifford_utils
, which is left with minimal updates in this PR.